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Method of improving recogfiiioiiScicuFacpTO 1 3 A PR 



in form-based data entry systems 



The preset inventioTi relates to methods of improving recognition accuracy in the area of 
5 interpreting data entered into a form-based data entry system. 

Bacicground to the Invention 

Many different systems require a user to interact and to provide data via one or more 
different means. On-line systems include those found on Internet web pages, and off-line 
10 systems include hand-written form creation where the hand-written forms are later scanned 
and interpreted by a suitable apparatus. Other on-line systems include voice recognition 
systems where a user is prompted to speak in response to a particular prompt. 

Problems with such data input systems, also known as natural language systems, include 
1 S noise and ambiguity, widi different users speaking, writing or odierwise entering data in an 
inconsistent manner. 



Various methods, systems and apparatus relating to the present invention are disclosed in 
20 the following co-pending applications filed by the appUcant or assignee of the present 
invention. The disclosures of all of these co-pending applications are incorporated herein 
by cross-reference. 

5 October 2002: Australian Provisional Application 2002952259 "Methods and Apparatus 
25 (NPT019r. 

15 October 2002: PCT/AU02/01391, PCT/AU02/01392, PCT/AU02/01393, 
PCT/AU02/01394 and PCT/AU02/0 1395. 

30 26 November 2001 : PCT/AUO 1/0 1527, PCT/AUOl/01 528, PCT/AUO 1/0 1529, 
PCT/AUOl/01530 and PCT/AUOl/01 531. 




11 October 2001: PCT/AUOl/01 274. 
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27 November 2000: PCT/AUOO/01442, PCT/AUOO/01444, PCT/AUOO/01446, 
PCT/AUOO/01445, PCT/AU00/01450, PCT/AUOO/01453. PCT/AU00/01448. 
5 PCT/AUOO/01447, PCT/AUOO/01459, PCT/AUOO/01451, PCT/AUOO/01454, 
PCT/AUOO/01452, PCT/AUOO/01443, PCT/AUOO/01455, PCT/AUOO/01456, 
PCT/AUO0/01457, PCT/AUOO/01458 and PCT/AUOO/01449. 

20 October 2000: PCT/AUOO/01273, PCT/AUOO/01279, PCT/AUOO/01288, 
10 PCT/AUOO/01282, PCT/AUOO/01276, PCT/AU00/01280, PCT/AUOO/01274, 
PCT/AU00/01289, PCT/AUOO/01275, PCT/AUOO/01277, PCT/AUOO/01286, 
PCT/AUOO/01281, PCT/AUOO/01278, PCT/AUOO/01287, PCT/AUOO/01285, 
PCT/AUOO/01284 and PCT/AUOO/01283. 

15 15 September 2000: PCT/AUOO/01 108, PCT/AUOO/01 1 10 and PCT/AUOO/01 111. 

30 June 2000: PCT/AU00/00762, PCT/AU00/00763, PCT/AU00/00761, 
PCT/AU0O/0076O, PCT/AUOO/00759. PCT/AUOO/00758, PCT/AUOO/00764, 
PCT/AUOO/00765, PCT/AUOO/00766, PCT/AUOO/00767, PCT/AUOO/00768, 
20 PCT/AU00/00773, PCT/AU00/00774, PCT/AUOO/00775, PCT/AU00/00776, 
PCT/AUOO/00777, PCT/AUOO/00770, PCT/AUOO/00769, PCT/AUOO/00771. 
PCT/AUOO/00772, PCT/AUOO/00754, PCT/AUOO/00755, PCT/AUOO/00756 and 
PCT/AUOO/00757. 

25 24 May 2000: PCT/AUOO/005 1 8, PCT/AU00/005 19, PCT/AU00/00520, 

PCT/AU00/00521, PCT/AUOO/00522, PCT/AUOO/00523, PCT/AUOO/00524, 
PCT/AUOO/00525, PCT/AUOO/00526, PCT/AUOO/00527, PCT/AUOO/00528, 
PCT/AUOO/00529, PCT/AUOO/00530, PCT/AUOO/00531, PCT/AUOO/00532, 
PCT/AUOO/00533, PCT/AUOO/00534, PCT/AUOO/00535, PCT/AUOO/00536, 

30 PCT/AU00/00537, PCT/AU00/00538, PCT/AU00/00539, PCT/AU00/00540, 
PCT/AU00/00541, PCT/AUOO/00542, PCT/AUOO/00543, PCT/AUOO/00544, 
PCT/AUOO/00545, PCT/AUOO/00547. PCT/AUOO/00546. PCT/AUOO/00554, 
PCT/AUOO/00556, PCT/AUOO/00557, PCT/AUOO/00558, PCT/AUOO/00559, 
PCT/AUOO/00560, PCT/AU00/00561, PeT/AU00/00562, PCT/AUOO/00563. 
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PCT/AUOO/00564, PCT/AUOO/00565, PCT/AUOO/00566, PCT/AUOO/00567, 
PCT/AUOO/00568, PCT/AU60/00569, PCT/AUOO/00570, PCT/AUOO/00571, 
PCT/AUOO/00572, PCT/AUOO/00573, PCT/AUOO/00574, PCT/AUOO/00575, 
PCT/AUOO/00576, PCT/AUOO/00577, PCT/AUOO/00578, PCT/AUOO/00579, 
5 PCT/AUOO/0058 1 , PCT/AUOO/00580, PCT/AUOO/00582, PCT/AUOO/00587, 
PCT/AUOO/00588, PCT/AUOO/00589, PCT/AUOO/00583, PCT/AUOO/00593, 
PCT/AUOO/00590, PCT/AUOO/00591, PCT/AUOO/00592, PCT/AUOO/00594, 
PCT/AUOO/00595, PCT/AUOO/00596, PCT/AUOO/00597, PCT/AUOO/00598, 
PCT/AUOO/005 1 6, PCT/AUOO/005 1 7 and PCT/AUOO/005 1 1 . 

10 

Description of the Prior Art 

US 5237628 describes an optical recognition system that is able to recognise machine 
printed, but not hand written characters, to locate the form fields in the digital image by 
locating the machine printed field identifiers. Once a field has been identified, offline 
1 5 handwritten character recognition is used to recognise individual characters in each field. 

US 5455872 discloses a field based recognition system which is able to select the optimum 
type of classifier (e.g. constrained handprint, unconstrained handprint, unconstrained 
cursive writing) for use with a particular field in a form. The system uses an adaptive 
20 weighting system and confidence values to determine the best classifier to use. 

US5235654 describes a system which incorporates form definition capabilities with a 
character recognition processor. 

25 SiberSytems offer a product utilising a form definition language that uses Artificial 
bitelligence techniques to deduce different field types that appear on a form. 

Summary of the present invention 

In a broad form, the present invention provides a method of interpreting data input to a 
30 form-based data entry system, including decoding data entered into a particular form field 
such that its information content can be determined, said information content being in a 
consistent machine-readable format, wherein said decoding of data includes determining 
one or more possible values of information content, certain pre-defined possible outcomes 
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being given a relatively higher probability of being correct, and said pre-defined possible 
outcomes being dependent on the context of the particular fonn field. 

Preferably, said decoding of data is performed on written or voice data. 

5 

Said decoding may be performed online, where the decode takes place contemporaneously 
with the data entry, or offline, where the decode takes place some time after data entry. 

Preferably, a particular form field has associated with it a predefined dictionary of possible 
10 decoded data, and said dictionary may be used to constrain the decode process such that a 
particular decode either has to reside in the dictionary, or that there should at least be a 
certain probability that it does. 

Preferably, certain possible decodes can be given a hi^er probability of being correct. An 
15 example of this might be a name field, where Smith has a higher chance of being the 
correct decode than Smithfield. 

Embodiments of the present invention offer advantages in that more successful recognition 
of data input can be achieved in natural language systems by decoding the data input based 
20 on the context of the field in which the data is entered. 

Brief Description of the Drawings 

For a better tmderstanding of the present invention and to imderstand how the same may be 
brought into effect, the invention will now be described by way of example only, with 
25 reference to the appended drawings in which: 

Figure 1 shows a typical form having two input fields; 

Figure 2 shows another typical form having two different input fields; and 

30 

Figures 3a and 3b shows two different but similar handwriting samples. 
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DctaUed Description of the Preferred Embodiments 

In the preferred embodiment, the invention is configured to work with the Netpage 
networked computer system, a detailed description of which is given in our co-pending 
applications, including in particular PCX application WO0242989 entitled "Sensing 
5 Device" filed 30 May 2002, PCT application WO0242894 entitled "Interactive Printer" 
filed 30 May 2002, PCT application WO02 14075 "Interface Surface Printer Using 
Invisible Ink" filed 21 February 2002, PCT application WO0242950 "Apparatus For 
Interaction With A Network Computer System" filed 30 May 2002, and PCT application 
WO03034276 entitled "Digital Ink Database Searching Using Handwriting Feature 
10 Synthesis" filed 24 April 2003. It will be appreciated that not every implementation will 
necessarily embody all or even most of the specific details and extensions described in 
these applications in relation to the basic system. However, the system is described in its 
most complete form to assist in understanding the context in which the preferred 
embodiments and aspects of the present invention operate. 

15 

In brief summary, the preferred form of the Netpage system provides an interactive paper- 
based interface to online information by utilizing pages of invisibly coded paper and an 
optically imaging pen. Each page generated by the Netpage system is uniquely identified 
and stored on a network server, and all user interaction with the paper using the Netpage 
20 pen is captured, interpreted, and stored. Digital printing technology facilitates the on- 
demand printing of Netpage docviments, allowing interactive applications to be developed. 
The Netpage printer, pen, and network infi^tructure provide a paper-based altemative to 
traditional screen-based applications and online publishing services, and supports user- 
interface fimctionality such as hypertext navigation and form input. 

25 

Typically, a printer receives a document firom a publisher or application provider via a 
broadband connection, which is printed with an invisible pattern of infrared tags that each 
encodes the location of the tag on the page and a imique page identifier. As a user writes 
on the page, the imaging pen decodes these tags and converts the motion of the pen into 
30 digital ink. The digital ink is transmitted over a wireless channel to a relay base station, 
and then sent to the network for processing and storage. The system uses a stored 
description of the page to interpret the digital ink, and performs the requested actions by 
interacting with an application. 
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Applications provide content to the user by publishing documents, and process the digital 
ink interactions submitted by the user. Typically, an application generates one or more 
interactive pages in response to user input, which are transmitted to the network to be 
stored, rendered, and finaily printed as output to the user. The Netpage system allows 
5 sophisticated applications to be developed by providing services for document publishing, 
rendering, and delivery, authenticated transactions and secure payments, handwriting 
recognition and digital ink searching, and user validation using biometric techniques such 
as signature verification. 

10 Embodiments of the present invention are operable in either on-line or off-line situations to 
decode natural language input data. Such input data can take the form of handwriting, 
spoken words or other non-constrained forms of input. 

For the purposes of this description, 'on-line' refers to s>^tems where the input data is 
IS decoded in real-time, i.e. contonporaneously with the input of the data. In other words, the - 
decoding process is able to work with dynamic information, such as the trajectory of the 
various strokes which make up a written character. A typical on-line system is an Internet 
web page, where the input is accepted, for instance, in the form of handwritten characters 
entered via means of a stylus and a suitable graphics tablet. 

20 

For the purposes of tliis description, 'ofT-line' refers to systems where the input data is 
recorded, but the decoding does not occur until some time later. In other words, the 
decoding is only able to woric with a static representation of the input, such as a bitmap 
image of a written character. A typical off-line system is a handwritten form data capture 
25 system where a user completes a form using handwriting and regular pen, and at a later 
time, the completed form is scaxmed and processed to extract the data encoded therein. 

As has been noted, the use of natural language input systems poses a number of problems 
for system designers. There is a great range of different writing styles, both from person to 
30 person, and even for the same person on different occasions or using different writing 
implements. Likewise, there is a wide variety of accents, intonations, dialects and pitches 
of voices, each making it difficult to distinguish voice input from different speakers. 
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Embodiments of the present invention provide a method for improving recognition 
accuracy in a variety of natural language data input sj^tems. The improvement is achieved 
by constraining the set of possible data which may be entered in a particular field, based on 
certain attributes of the field itself. In one embodiment, the constraint may be absolute, in 
S that the data entered in the field must be found in a defined set of data associated with that 
field. 

In other embodiments, the constraint may be partial, in that a greater weighting is given to 
data input which is found in a defined set of data. In these cases, if a data entry is decoded 
10 and found not to reside in the list of higjier-weighted outcomes, it is still accepted, whereas 
in the previous embodiment, such a result would be discoimted. 

In a form-based data entry system, the form includes one or more fields, each of which is 
able to receive a data entry. In the following description, for convenience, embodiments of 
15 the invention will primarily described in terms of a system arranged to receive handwrittra 
input, but the skilled man will realise that other forms of data input, such as speech, can 
also benefit firom embodiments of the invention. 

Figure 1 shows a typical form 100 which is intended to capture name information firom two 
20 separate fields 110, 120. The field 110 labelled Tirst Name' is provided to capture an input 
fix>m a user giving his first name. The second field 120, labelled 'Last Name' is provided to 
capture an input firom a user giving his last name. 

In the first case, the associated processing system, whether on-line or off-line, is able to 
25 decode the input data, and constrain the likely results on the basis of information implicit 
in the field label, 'First Name*. The processing system is provided with a database of 
common first names such that when the handwritten input is decoded, a greater weighting 
is given to possible values of the decoded input which reside in the database of common 
first names. As an example, a particular xiser may be called 'Greg*. However, in his 
30 particular writing style, his name may appear to resemble 'Grey*. 

Figure 3a shows a graphic representation of a user's rendering of his first name in a form 
field. Figure 3b shows how the same user would render the word 'Grey*, and it is noticeable 
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that the two representations sire very similar, and differ only in the closed upper portion of 
the final letter 'g* in 'Greg* when compared to the y of 'Grey*. 

When the processing system seeks to decode and interpret the written input, a greater 
weighting is given to 'Greg', as this is far more likely to be a valid first name. Note that in 
this case, 'Grey* is a word which is to be found in a dictionary of acceptable words, but is 
imlikely to feature in a list of common first names. In this way, constraining the data by 
giving preference to common names over oth^ valid words has produced the correct 
result. In other cases, where two or more results are likely and all appear in the constrained 
list, the user may be prompted to re-enter the data, or be presented with an option to choose 
the correct one of the possible results fi-om a list of the probable results. 

The same process can be adapted for different fields likely to be found in different forms. 
The non-exhaustive exemplary list below details several fields and the kinds of constraints 
IS which may be applied to the decoding process to improve the likelihood of generating the 
correct outcome fi"om a given input. The ordinary skilled person will, of course, realise that 
different fields may have contextual constraints applied to them according to their 
particular properties. 



5 



10 
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Context Processing 

Large lists of common first names are widely and 
publicly available for use as dictionaries defining 
processing constraints during recognition. These lists, 
which are often derived firom census data, include 
associated a-priori probabilities, allowing common 
names such as "John" and "David" to be more 
frequently matched. If additional information from the 
form or elsewhere is available that indicates the gender 
of the writer, separate male and female lists can be used 
to further improve recognition accuracy. 

Note that during recognition, out<of-vocabulary words 
(i.e. names that do not appear in the name dictionary) 
can be allowed to ensure that unusual and uniquely 
spelled names can still be recognised correctly. This 
can be done by combining the dictionary decoding with 
a probabilistic grammar model (such as an character n- 
gram) that contains information regarding the a-priori 
probability of character sequences usually foimd in 
names. 

Similar to tiie above field, but using a dictionary of last 
names. Note that for Western names, there is generally 
much greater variability of last names across the 
population, so the probability of out-of-vocabulary 
words miist be higher than that for first name 
recognition. 

Address Most addresses follow a regular pattern (e.g. dwelling 

number, followed by street name and street type). The 
recognition system can exploit this pattern during 
decoding by, for example, using regular expression 



Field Label String 

First Name, Given Name, etc. 



Last Name, Surname, Family 
Name, etc. 



wo 2004/036488 



-10- 



PCT/AU2003/001341 



matching, or by altering the valid character set (i.e. 
digits only, letters only, 7* allowed or not allowed, etc.) 
as recognition proceeds. 

In addition to this, some elements in the address can be 
decoded with the assistance of a dictionary, such as 
street type ("Street", "Road", "Place", "Av«iue", 
"Crescent", "Square", "Hill" etc.) or street names 
(common street names include "Main", "Church", 
"North", "High", etc.) 

Suburb, Town, etc. Full lists of suburbs and towns are freely and publicly 

available for most regions. This information can be 
used in conjunction with other information such as state 
or postcode / zipcode information (if available) to 
further reduce the recognition alternatives. 

For instance, if it has already been established that the 
country of residence is e.g. Australia, then there are 
only seven possible values for the next hierarchical 
division of state or territory. Once that field has been 
decoded, a further constraining dictionary of suburbs or 
towns in that state/territory can be used to imit the 
possible outcomes. 

State Lists of states are available if the Country/Region is 

known. Each state can be given an a-priori probability 
corresponding in the likelihood that a person is from 
that state (i.e. large, populous states can be given a 
higher a-priori probability). Further constrains can be 
used if postcode / zipcode is known. 

Phone Number Phone numbers follow a regular pattern (e.g. "(##) 

####-####") that can be used during recognition. 
Additionally, the valid character set for a phone number 
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-Il- 
ls constrained to numbers only, fiather restricting the 
pot^tial recognition alternatives. 

Zip/Postal Code Zip/Postal codra within a given country generally 

follow a specific pattern. For example: in Australia, the 
postal codes are always four digits long; in the USA, 
five digits; and in the UK, a mix of one or more letters, 
followed by two or more numbers, followed by one or 
more letters again. Additional decoding constraints are 
available if the corresponding State and Suburb 
information is available. 

Country, Region, etc. Full lists of possible Country/Region labels are publicly 

available. 

Birth Date, Date of Birth, Written dates generally follow a regular pattern, and 

Other dates etc. have a constrained character set consisting of dither 

numbers alone or numbers and delimiting characters 

such as -'or V. 

Email, E-Mail, Email Email addresses follow a specific pattern and have a 

Address, etc. well-specified character set An example regular 

expression that can be used to matdi email addresses is 
"/'X[a-zA-Z0-9_\.\-])+\@(([a-zA-Z0-9\-])+\.)+([a-zA- 
Z0-9])+$/". 

In addition to this, if email contact information is 
available for a user (e.g. using Microsoft Windows 
Messaging API (MAPI)), the list of email addresses can 
be used as a dictionary during recognition. Similarly, 
common email domain namra (eg. **hotmail.com", 
'*yahoo.com", **email.com", etc.) can be used as 
dictionary entries to guide recognition. 

Credit Card, Credit Card Credit card numbers have a specific format (e.g. '*####- 

Number, etc. mmmmum *) and constrained character set 
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Additionally, there are often validation rules (e.g. check 
di^t tests) that can also be used during recognition. For 
example, if there are two equi-probable results for the 
recognition of a credit-card niunber, check digit 
validation may be of helpful in selecting the correct 
result. 

Language / Locale Lists of languages that are spoken around the world are 

freely available, and are currently used by many web 
forms. Once the language of a particular writer is 
known, it can be used to improve the processing of 
other types of input. Examples of this include different 
language-specific dictionaries (e.g. English, German, 
French, etc.) for text recognition, changing the valid 
recognition character set (e.g. allowing accented letters 
that are used by some Western European languages), 
and changing the format for date recognition. 

In addition to using publicly available or proprietary dictionaries, particular field labels 
may compile their own dictionaries over time, using previously recognised responses to 
guide and constrain future data entries. In this way, systems employing embodiments of 
5 the invention can improve their recognition capabilities as they operate over time and 
learn' more possible outcomes of the decode process. In this way, names which become 
more popular over time, for instance, can be given a higher a priori weighting. 

Most form definition formats support a number of different field types, such as text fields, 
10 selection list fields, combination fields (i.e. a field that combines text input with a selection 
list), signature fields, checkboxes, buttons, and so on. The field type gives some indication 
of the expected input data-type (e.g. a text input field indicates text entry). If a document 
format allows data-types to be explicitly defined (e.g. XML/XForms), a recognition system 
can use this information to constrain the recognition process. 



15 
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In addition to the field type, forms often contain information regarding the type of data that 
should be entered in each field. This information is usually contained in attributes that are 
associated with a specific field. One example of this is the set of selection strings that are 
commonly associated with list input fields. These strings represent the options fix)m which 
5 the user must make a selection, and can be used as dictionary elements during recognition. 
Similarly, recognition of combination fields can use a dictionary of selection strings in 
combination with a characto: grammar to allow words other than those listed in the option 
list to be recognized. 

10 Standard input fields may also contain attributes that can assist in the recognition 
procedure. For example, some input field types have a flag indicating that the value 
entered must be nimieric, signifying to the recognition system that the recognised character 
set should only include digits. Input fields may also contain a mask attribute, which is a 
string indicating that the input must match the specified pattern (e.g. "####AA" requiring 

IS that four digits followed by two upper-case alphabetic letters be entered such as 
**2002CY*')- This mask can be used to constrain the valid recognition character set at each 
offset in the string and thus improve the recognition accuracy. 

Many forms specify validation parameters that can be used to guide the recognition 
20 process. For example, numeric input fields may specify minimum and maximum values 
that can be used to constrain the recognition results. Other fields may contain validation 
program code (e.g. JavaScript ) that is executed when the user has entered a value into the 
field. This code can be executed multiple times, with each individual recognition result as 
a parameter, allowing potential alternative results that do not conform to the validation 
25 requirements to be discarded. 

In addition to using standard form field attributes to improve the recognition process, 
recognition-specific information can be added to fields using custom attributes. This 
information is only used if the form input is processed using a recognition system. Thus, 
30 the form can still be used normally where required (e.g. data entry using a keyboard via a 
web browser) since the custom attributes are ignored; however, if recognition is required, 
the custom parameters can be used to improve the recognition results. 
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Some examples of custom field attributes include character set definition (where the set of 
valid characters for a field is explicitly defined) and regular expressions. If the fields are 
displayed or printed using visual cues to guide character spacing (e.g. boxes on forms 

S where each box must contain a single character), the parameters of the guide can be 
associated with the field as custom attributes to assist with the character segmentation 
stage of the handwriting recognition. For example, by specifying the coordinates of the 
bounding rectangle and the number of rows and colimms in a field that uses character 
boxes for input, the recognition system can be informed of the expected location of each 

1 0 character, allowing more accurate recognition to occur. 

Information regarding context processing and language modelling can also be encoded in 
custom attributes. Some handwriting recognition systems use a combination of language 
models to assist in the recognition of handwritten text (e.g. n-gram character models, 

15 standard dictionaries, user-specific dictionaries). These models are usually combined using 
a set of weightings that indicate the likelihood that an input word will be decoded correctly 
using each of the specified models. However, the most accurate results are produced when 
the weightings can be customised depending on the expected input. By incliiding the 
language model weights as a custom attribute for a field, more accurate recognition can be 

20 achieved by tuning the model weights on a per form or even per field basis. 

To allow more control over the recognition procedure, custom validation program code 
(e.g. JavaScript) can be associated with a field that is executed on each potential result 
after the handwriting recognition procedure has completed, allowing the most appropriate 

25 result to be selected. However, rather than using a Boolean validation fimction (i.e. a string 
is either valid or invalid), the fimction can return a confidence value tiiat indicates the 
probability that the string would be entered. This probability can be combined with the 
results of the character classification procedure to select the most probable recognition 
result. In this way, even if a decoded result has a low confidence value associated with it, it 

30 may still be accepted by the system if other checks confirm that it is a valid response. A 
simple Boolean approach may result in valid inputs being discoimted. 
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An improvement to this scheme is to define a language model probability function that is 
called by the recogniser as each character is recognised by the system. This allows a 
reco^tion system to prune unlikely or invalid recognition string early in the recognition 
procedure, allowing long strings of text to be recognised efficiently. During the recognition 
5 procedure, a large number of potential results are produced by considering different 
combinations of recognised characters. Typically, there are a large number of potential 
character altematives for each letter position, so for text of even moderate length, there are 
a large number of altematives. As a result, recognition systems generally use a beam 
search technique, such that the n best altematives at each letter position are considered, 
10 where n is typically between 10 and 100. Thus, the n most likely results at each position 
are stored, with the remainder discarded. 



However, to select the n best results at each step requires validation fi-om the language 
model at each step rather than after the recognition procedure has completed, otherwise 

15 high-scoring strings that are impossible or unlikely as defined by the language model may 
be retained while valid but lower-scoring strings are discarded. As a result, the improved 
language model function should be able to calculate and return a sub-string probability, so 
that the recogniser can combine the character classification probability widi the sub-string 
probability at each step, and thus select the n most likely strings. This flexible approach 

20 allows almost any language model, including dictionaries and character Markov-models, to 
be implemented. 



The following part describes how data may be extracted for various commonly used form 
definition formats, including HTML, XForms and PDF (Adobe Portable Document 
25 Format). 



Hypertext Maik-up Language (HTML) is a standard set of mark-up symbols used to define 
the format of a page of text and graphics that is intended for display in a Worid Wide Web 
browser. HTML is a formal recommendation by the World Wide Web Consortiimi (W3C) 
30 and is defined in the W3C "HTML 4.01 Specification" of 24 December 1999. XHTML, a 
reformulation of HTML as an XML application, is very similar to HTML and is defined in 
the W3C "XHTML 1.0 The Extensible HyperText Markup Language (Second Edition)" of 
1 August 2002, and similarly, SGML which is defined in the ISO "Information Processing 
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- Text and office systems - Standard Generalised Markup Language (SGML)", ISO 8879 
of 1986. 



Some example HTML code for a form is given below (an example of the output that this 
5 code might generate in a browser is given in Figure 1 . 



<html> 

<form ACTION="cgi-bin/form.exe" METHOD=post> 
<pxb>Please Enter Your Name</bx/p> 
1 0 <p>First Name: <INPUT TYPE="TEXT" NAME="FirstName" 

CUSTOM="Hello"x/p> 
<p>LastName: <INPUT TYPE="TEXT" 
NAME="LastName"x/p> 

<pxINPUT TYPE="SUBMrr" NAME="Submit"x/p> 
15 </form> 
</html> 



Usually, field labels associated with input fields can be easily derived from the HTML 
20 document source. Generally, field labels appear as normal text immediately before the 
input field definition (as shown above). In other situations, the layout of the rendered 
document can be analysed to determine which text labels should be associated with which 
input fields (for example, when a table is used for form layout). Additionally, the **name" 
attribute that is associated with many input elements may contain text that will allow the 
25 field type to be determined. 



Standard HTML contains a number of element attributes that can be usefully used as hints 
to a recognition system. Some examples include: 



30 



• the "maxlength" attribute of an INPUT element that can be used to limit the length 
of the recognised text. 
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• the OPTION elements associated with a SELECT element that represent the set of 
valid input strings (which can be used as dictionary entries during recognition), and 

• the *'rows" and "cols" attributes in a TEXT AREA element that could be used to 
define a character spacing guide (e.g. boxed input where each letter must be written 



In addition to this, custom attributes can be easily added to HTML field elements (e.g. 
CUSTOM="Hello*'), since browsers and other systems processing a page must ignore 
attributes that are unknown. In this way a form designer can add custom elements to 
HTML soiirce code which will only be used by recognition systems and will safely be 
1 0 ignored by 'dumb' browsers. 

XFORMS is a standard form definition language defined by W3C and described in 
"XForms 1.0" W3C Working draft of 21 August 2002. The XForms standard has been 
developed as a successor to HTML forms, and implements device independent form 
1 5 processing by allowing the same form to operate on desktop computers, hand-rheld devices, 
information appliances, and even paper. To achieve this, XForms ensures that, unlike 
HTML, data definitions are kept separate from presentation. An example of XForms code 
is given below. An example of the output that this code might generate in a browsCT is 
given in Figure 2. 



5 



in a separate box). 



20 



<xfoTm> 



<submitInfo action=''form.exe" method="post"/> 
</xform> 



25 



<input xform="payment" ref=''cc"> 
<caption>Credit Card Number</caption> 

</inputxinput xform="payment" ref^"exp"> 
<caption>Expiration Date</caption> 

</inputxsubmit xform=''payment"> 
<caption>Submit</caption> 

</submit> 



30 
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In a similar manner to HTML, field labels can be derived from the XForms code by 
examining the caption element in the input field definitions. In addition to this, XForms 
supports input field elements similar to those described previously for HTML, including 
5 the list selection elements "<selectOne>" and "<selectMany>" and associated "<item>" 
elements that can be used a dictionary entries during recognition processing. 

The XForms specification includes a set of data-types for field input, including date, 
money, number, string, time, and URI types. This information can be used by a recognition 
10 system to improve recognition accuracy. Similarly, the specification includes data 
attributes (e.g. currency, decimal places, integer) and validation attributes (minimum value, 
maximum value, pattern, range), which can be used to further improve recognition results. 

Portable Document Format (PDF) is a document format defined by Adobe that has become 
IS the de-facto standard for Internet-based document distribution. Recently, Adobe has added 
interactive elements that allow the definition of forms for online use. 

Like HTML and XForms, PDF form elements have a specific type (e.g. text, signature, 
combo box, list box) that defines the behaviour of the element and thus can be used as a 
20 guide for a handwriting recognition system. They also contain a field name (e.g. "/T 
(FirstName)") that may contain a usefiil label that indicates the type of data to be entered 
into the field. List and combination fields contain a set of options (*VOpt 
[(Optionl)(Option2)]") that define the valid selection strings. 

25 Additional field attributes include a format specifier (e.g. number, percent, date, time, zip 
code, phone number, social security number, etc.) and JavaScript validation code that is 
executed when data has been entered into the field. Custom attributes can also be easily 
incorporated in field definitions, as shown above ("/CUSTOM_ATTRIBUTE 
(HelloWorid)"). 
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Embodiments of the present invention may be implemented using a suitable programmed 
and conditioned microprocessor. Such a microprocessor may form part of a custom 
system, specifically designed to operate in a character recognition environment or, it may 
be a general purpose computer, such as a desktop PC, which is also able to perform other 
5 more general tasks. 

In the light of the foregoing description, it will be clear to the ordinary skilled person that 
various modifications may be mode within the scope of the invention. 



10 



The present invention includes any novel feature or combination of features disclosed 
herein either explicitly or any generalisation thereof irrespective of whether or not it relates 
to the claimed invention or mitigates any or all of the problems addressed. 



